An explainable machine learning model for prediction of high-risk nonalcoholic steatohepatitis

Early identification of high-risk metabolic dysfunction-associated steatohepatitis (MASH) can offer patients access to novel therapeutic options and potentially decrease the risk of progression to cirrhosis. This study aimed to develop an explainable machine learning model for high-risk MASH prediction and compare its performance with well-established biomarkers. Data were derived from the National Health and Nutrition Examination Surveys (NHANES) 2017-March 2020, which included a total of 5281 adults with valid elastography measurements. We used a FAST score ≥ 0.35, calculated using liver stiffness measurement and controlled attenuation parameter values and aspartate aminotransferase levels, to identify individuals with high-risk MASH. We developed an ensemble-based machine learning XGBoost model to detect high-risk MASH and explored the model’s interpretability using an explainable artificial intelligence SHAP method. The prevalence of high-risk MASH was 6.9%. Our XGBoost model achieved a high level of sensitivity (0.82), specificity (0.91), accuracy (0.90), and AUC (0.95) for identifying high-risk MASH. Our model demonstrated a superior ability to predict high-risk MASH vs. FIB-4, APRI, BARD, and MASLD fibrosis scores (AUC of 0.95 vs. 0.50, 0.50, 0.49 and 0.50, respectively). To explain the high performance of our model, we found that the top 5 predictors of high-risk MASH were ALT, GGT, platelet count, waist circumference, and age. We used an explainable ML approach to develop a clinically applicable model that outperforms commonly used clinical risk indices and could increase the identification of high-risk MASH patients in resource-limited settings.


Development of a machine learning model using XGBoost
We used the eXtreme Gradient Boosting (XGBoost) algorithm to develop our machine learning model.The data was split into three independent cohorts of patients with approximately equal proportions of subjects with high-risk MASLD: one for training the model (training set), another for validation during hyperparameter optimization (validation set), and a test or holdout set (test set) to evaluate the prediction performance.We performed 100 iterations of model training with hyperparameter optimization to maximize the harmonic mean of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), a custom performance metric developed to maximize all individual metrics and penalize for outliers with low metrics.
After training, we tested the model on the independent test set.The evaluation metrics we used included the area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity, PPV, and NPV.To estimate the 95% confidence intervals (CI) for these metrics, we used a bootstrapping method with 1000 iterations.We also performed k-fold cross-validation to evaluate model robustness and to reduce bias from data partitioning.Details about model training, hyperparameter optimization, and strategies to minimize overfitting on Supplementary Methods 2.

Shapley additive explanations
To facilitate the interpretation of our XGBoost classification model, we used Shapley Additive Explanations (SHAP) approach 16,17 to determine the contribution of each predictor, known as a feature in computer science, toward the final prediction of high-risk MASLD.SHAP values provide a measure of predictor importance that accounts for both the individual feature values and their interactions with other features based on their impact on the ultimate prediction.

Missing data handling
We did not censor patients with missing data or impute missing data to train and evaluate the XGBoost models due to their unique ability to learn and inference predictions despite data missingness.However, other model algorithms (e.g., logistic regression and random forest) are unable to handle missing data and thus we used a K nearest-neighbor (KNN) imputed 18  www.nature.com/scientificreports/performance between XGBoost, logistic regression, and random forest on a KNN-imputed dataset to compare diagnostic performance on models trained on the same dataset (Supplementary Table 1).

Statistical analysis
We used χ 2 and t-tests to describe patient characteristics between the high-risk MASLD outcome groups in the complete set as well as training and test sets.We included demographics, clinical and laboratory information, the number of participants, and missing data.Additionally, we report mean differences between high-risk NASH groups.For continuous variables, we used the

NCHS ethics statement
The research presented in this paper adheres to the ethical principles and guidelines set forth by the National Center for Health Statistics (NCHS).The NCHS is committed to ensuring the rights, welfare, and privacy of individuals participating in research studies.The NCHS ethics statement underscores our commitment to upholding the highest ethical standards in research.By adhering to these principles, we aim to contribute to the advancement of knowledge while ensuring the protection and well-being of all participants involved in our study.

Subject characteristics
There was a total of 5156 subjects meeting the inclusion criteria.The prevalence of high-risk MASLD at FAST ≥ 0.35 and FAST ≥ 0.67 were 5.8% and 1.1%, respectively.The median age was 55 (IQR range 37-67 years), and 2490 (48%) were women among all subjects (Table 1).There were more men than women in the high-risk MASLD group (67.5%, p < 0.001) and more Hispanic individuals (12.6%, p < 0.001) in the high-risk MASLD compared to the no high-risk MASLD group (9.7%).The high-risk MASLD group had a higher prevalence of diabetes in their medical history (27.2%, p < 0.001) compared to the no high-risk MASLD group.Physical exam results showed higher body mass index (BMI, median 34 kg/m 2 , p < 0.001) and waist circumference (median www.nature.com/scientificreports/113 cm, p < 0.001) measurements for the high-risk MASLD subjects.In terms of laboratory results, the high-risk MASLD group had higher liver enzymes (median AST, ALT, GGT of 36, 46, 43 U/L, respectively, p < 0.001), lower platelet counts (median 219 × 103 cells/μL), and higher hemoglobin (Hb)A1c (median 6.0, p < 0.001), plasma glucose (median 117 mg/dL, p < 0.001), and insulin (20.4 μU/mL, p < 0.001).Additionally, the high-risk MASLD subjects had lower levels of HDL (median 43 mg/dL, p < 0.001).A complete comparison of all 127 predictors used to develop the exploratory ML models (i.e., prior to selecting the top 5 predictors to fine-tune subsequent models) can be found in Supplementary Data.
XGBoost was the top-performing model in 3 of the 4 comparisons, followed by LR and RF both with 2 out of 4 (Supplementary Table 1).

Explaining cohort-and patient-level predictions by XGBoost MASLD models
We conducted SHAP for tree-based models to evaluate how the predictors used to train the models influenced the predictions made by XGBoost MASLD models (Fig. 2; Supplementary Figs.5-6).A prediction of high-risk MASLD was more likely when SHAP > 0 (i.e., prediction probability, P pred ≥ 0.50), and a prediction of no highrisk MASLD was more likely when SHAP < 0 (P pred < 0.50).

Explaining cohort-level XGBoost MASLD predictions
In descending order of impact on predictions by XGBoost MASLD FAST≥0.35trained on the top 5 predictors, ALT, BMI, GGT, age, and platelet counts.ALT, BMI, GGT, age had a positive impact on high-risk MASLD prediction at FAST ≥ 0.35, whereas platelet count had a negative impact on high-risk MASLD prediction (i.e., positive impact on no high-risk MASLD prediction at FAST ≥ 0.35).Further, predictions by XGBoost MASLD FAST≥0.67 were influenced first by ALT, followed by BMI, GGT, platelet count, and age at last.ALT, BMI, GGT had a positive impact on high-risk MASLD prediction at FAST ≥ 0.67, platelet count had a negative impact, and any age had a negative impact on predictions.We show the top 20 predictor contributions for XGBoost MASLD models trained on all 127 predictors (Supplementary Figs.5-6).ALT, GGT, BMI were the top 3 predictors for both XGBoost MASLD FAST≥0.35 and XGBoost MASLD FAST≥0.67.Platelet count was in the top 20 features of both models, whereas age was only in XGBoost MASLD FAST≥0.35 .

Distribution of predictor values, model contribution, and prediction accuracy
We compared the predictor-specific SHAP values (i.e., contribution to model prediction by that unique predictor) and their corresponding predictor values for each subject in the test set (Supplementary Fig. 7).Generally, there was a positive correlation between ALT, GGT, and BMI and their SHAP values, a negative correlation between platelet count and its SHAP values, and a positive correlation between age and its corresponding SHAP values for   www.nature.com/scientificreports/XGBoost MASLD FAST≥0.35only.We display the few false positive (orange) and false negative predictions (green), and further investigate unique cases of inaccurate predictions in Fig. 3.

Explaining patient-level XGBoost MASLD predictions
We describe four cases of unique subjects and their corresponding local SHAP values showing the contribution of ALT, GGT, platelets, age, and BMI on the predictions by XGB MASLD FAST≥0.35 (Fig. 3).

Discussion
In this observational study, we developed an ensemble-based machine learning using XGBoost to detect individuals in the U.S. population with high-risk MASLD and explored cohort-and subject-level prediction interpretability using explainable artificial intelligence with SHAP analysis.This application of explainable machine learning for the identification of high-risk MASLD contributes to the growing body of research in this area.To the best of our knowledge, this is the first nationwide application of explainable machine learning for the identification of high-risk MASLD.While traditional machine learning applications have shown promise in detecting MASLD, MASH and advanced fibrosis, our explainable XGBoost MASLD model demonstrates unique functionalities not previously reported in the literature, including: (1) ability to learn despite missing data, (2) use of five, easily accessibly patient features (ALT, GGT, platelet count, age, and BMI), and (3) prediction explanation with feature-specific SHAP values.
Although prior applications of traditional machine learning to detect MASLD, MASH and advance fibrosis have produced promising results, we showed that our explainable XGBoost MASLD model outperformed previous traditional ML models in detecting high-risk MASLD.Compared to Wu et al. 19 , Docherty et al. 4 , and Ghandian et al. 5 , who reported sensitivities up to 0.82 and specificities up to 0.79, our model achieved comparable sensitivity (mean [95%], 0.71 [0.59-0.83]),but higher specificity (0.97 [0.96-0.98]),accuracy (0.95 [0.94-0.97]),and AUROC (0.95 [0.91-0.97]).Additionally, we report a good PPV of 0.59 [0.47-0.70]and excellent NPV of 0.98 [0.97-0.99].This is a significant improvement in performance, which can help to improve the diagnosis and management of high-risk MASLD patients in resource-limited settings.It is important to acknowledge, however, that the natural history of MASLD is complex and the terminology describing disease stages is evolving.Additionally, our study's endpoints might differ from those in previous studies, which could impact comparative assessments.Therefore, our findings contribute to a broader understanding of the potential of machine learning in this evolving field and must be interpreted within this context.Furthermore, we also compared the performance of serologic biomarkers and the XGBoost MASLD in identifying high-risk MASLD patients.Although FIB4, NFS, and APRI had higher sensitivity and NPV, XGBoost MASLD had a good sensitivity of 0.77 and excellent NPV of 0.99 in the test set.In addition, XGBoost MASLD had the highest specificity and PPV at 0.97 and 0.61 while other serologic biomarkers did not surpass 0.14 and 0.06, respectively.These results suggest that our XGBoost model can serve as a promising tool for identifying high-risk MASLD patients.
Our findings reveal that the subject characteristics in our study are consistent with the classical clinical manifestation of the MASLD-MASH disease spectrum.We observed that the high-risk MASLD group had higher rates of obesity, type 2 diabetes, and metabolic syndrome, which includes hypertriglyceridemia, low HDL-C, and abdominal obesity 20 , when compared to the non-high-risk group.Additionally, the high-risk MASLD group demonstrated elevated transaminases in the absence of heavy alcohol consumption and no history of hepatitis B or C, which aligns with the diagnostic criteria for MASLD.These results are consistent with previous studies that have identified obesity and type 2 diabetes as significant risk factors for the development of MASLD 21 .
This study had several limitations.First, although the optimized XGBoost model was validated using a test (holdout) set, it would need to be further prospectively externally validated before widespread adoption.Second, we did not have data on liver biopsy for gold-standard comparison.Third, the adoption of the optimized model in clinical practice as well as its integration into electronic medical records will need to be evaluated in future studies.Fourth, as with any machine learning modality, possible "overfitting" is a significant limitation.To address this, we performed the following overfitting mitigating strategies: (1) using 3 sets of data partitioning to validate the model while training and a separate test (holdout) set for internal validation, (2) hyperparameter optimization through 100 iterations of XGBoost MASLD models with unique combinations of regularization, subsampling parameters, (3) early stopping in model training, (4) balanced weighting to avoid overfitting on a highly prevalent class, (5) and internal validation strategies including k-fold cross-validation and bootstrapping metrics on the test set.Finally, a major limitation that warrants discussion is the dependency of the XGBoost model performance on the training cohort characteristics.This raises important concerns about the generalizability of these models in other settings and populations.The difficulty in direct comparison of ML models unless developed and validated on similar cohorts underscores the necessity for cautious application and interpretation of our findings across different clinical environments.This limitation highlights the importance of ongoing evaluation and adaptation of ML models in diverse settings to ensure their relevance and efficacy.Finally, given concerns with generalisability, future research is advised to utilize an external cohort for validation, which was out of the scope for this work but would significantly support the generalisability of the conclusion(s) made.

Conclusion
In conclusion, our study demonstrates the potential of explainable machine learning in the detection of highrisk MASH.The development of an XGBoost model that outperforms well-established serologic tests has shown the ability of machine learning to detect high-MASH in a more comprehensive and flexible manner.The high complexity of our model allows for the detection of heterogeneous subphenotypes, a feature not present in most serologic tests.While the pathophysiology of liver fibrosis in MASH is complex and variable, our model has proven successful in classification.These findings suggest that a more multidisciplinary approach that incorporates machine learning may lead to improved diagnosis and management of patients with MASH, ultimately optimizing clinical outcomes.Further studies are needed to explore the clinical applications of our proposed XGBoost model in identifying high-risk MASH patients.If externally validated, our explainable ML model could be used to increase the identification of high-risk MASH patients in resource-limited settings.

Figure 2 .
Figure 2. Impact of predictors on high-risk MASLD prediction using SHAP values.Summary Shapley Additive Explanations (SHAP) plot shows the importance and impact of various training variables (predictors) on XGB MASLD FAST≥0.35 .The SHAP values on the x-axis quantify the influence of each predictor, with positive values favor high-risk MASLD prediction and negative values favor no high-risk MASLD prediction.The predictors, including ALT, BMI, GGT, age, and platelet count, are ranked by magnitude of impact.Predictor values are color-coded, with red indicating higher values and blue lower variable values (e.g., ALT of 120 U/L in red and 12 U/L in blue).

Figure 3 .
Figure 3. Patient-Specific SHAP Value Impact on XGB MASLD FAST≥0.35Predictions.Four cases of patients and their corresponding local SHAP values showing the contribution of ALT, GGT, platelets, age, and BMI on the predictions by XGB MASLD FAST≥0.35 .Upper panel shows correct high-risk MASLD (top left) and no high-risk MASLD (top right) predictions.Bottom panel shows incorrect predictions, including a false negative (bottom left) and false positive (bottom right).Red and blue bars indicate attributes driving the model to predict highrisk MASLD and no high-risk MASLD, respectively.Each subplot provides specific patient data points, their contribution to the model output (f(x)), and the baseline expected value (E[f(X)]) of the model.

Table 1 .
Select clinical and laboratory characteristics of adults older than 18 years with high-risk and no high-risk MASLD with acceptable FibroScan ® data in the National Health and Nutrition Examination Survey between 2017 and March 2020.High-Risk MASLD defined as subjects with a FAST score cutoff at ≥ 0.35.Further, p-values calculated using appropriate statistical tests based on distribution and variance.Continuous variables shown as median [IQR]; categorical variables as count (%).MD, mean difference.